1.0 Introduction and Background

During the Fall of 2018, a Google questionare was distributed via the RPI subreddit and various facebook groups to survey students that have completed the data structures course at RPI. The rows in the dataset represent individual students, and the columns represent attributes of that student.

The full dataset, metadata, and documentation can be found at: https://docs.google.com/spreadsheets/d/1Fz8jCMETZpIXIw_NwcYo3E5cS_kJ4E_s2tO0Ouhgo-I/edit?usp=sharing

This report was prepared by:

This report was finalized on 11/19/2018. This report is generated from an R Markdown file that includes all the R code necessary to produce the results described and embedded in the report.

This document is subject to revision as more visualizations can be concieved. The following sections are up to date but may change in later editions of this notebook.

Executing of this R notebook requires some subset of the following packages:

These will be installed and loaded as necessary (code suppressed).

2.0 Loading in the Data

The data was read in and linearly transformed as outlined in the DataScience_DataStructures.Rmd file. As such, the code is supressed, but executed.

3.0 Data Exploration

In this section we will briefly explore the distributions of the independent variables and subset the data as necessary for model construction.

3.1 Distribution

As visible in the plot below, it appears that the survey mostly had responses from individuals that are currently at RPI, which makes sense.

plot(data$class_year)

pie<-prop.table(table(data$class_year))
slices <- c(pie[1]+pie[2]+pie[3]+pie[4]+ pie[5]+ pie[6], pie[7], pie[8], pie[9], pie[10], pie[11]) 
lbls <- c("=<2017", "2018", "2019", "2020", "2021", "2022")
pct <- round(slices/sum(slices)*100)
lbls <- paste(lbls, pct) 
lbls <- paste(lbls,"%",sep="") 
pie(slices,labels=lbls,col=rainbow(length(lbls)),main="Pie Chart of Class Year Distribution, n=653")

I expected there to be more people that took DS in the spring than anywhere else, however I didn’t think the gap would be so small as visible below.

plot(data$semester,main="Distribution of the Semester DS was taken")

As visible below, the distribution of AP scores is mostly flat, with an aritificual peak at 0 which represents students that did not take the AP exam.

#hist(data$ap_grade,main="Histogram of AP Computer Science Grade",xlab = "Grade")
ggplot(data, aes(x=ap_grade)) + geom_histogram()+ggtitle("Distribution of AP Computer Science Grade")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

As you can see below, the distribution of the cumulative GPA is mostly flat, however it is multimodal with the highest frequency in the 3.0 to 4.0 range. It is possible that people entered an unrealistic GPA, thus more filtering is required. I removed the samples with a GPA below 2.0 as they are not in good academic standing. On the same note, 185 students are signifigantly below the good standing threshold (signifigantly below is defined to be < 1.5). These students are not performing well, or they have provided false GPA information. If false information is provided, it is probable that false information was recorded for their data structures grade. In either case, we will not consider them for the purposes of model creation. Henceforth, any generalization will refer to the subset of students that has a cumulative GPA that is greater than or equal to 2.

ggplot(data, aes(x=gpa)) + geom_histogram()+ggtitle("Distribution of Cumulative GPA")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

data_lt1.5<-data[data$gpa<1.5,]
data_gt2<-data[data$gpa>=2,]
cat("Number of students with a GPA below 1.5: ",nrow(data_lt1.5))
## Number of students with a GPA below 1.5:  185
cat("Number of students with a GPA above 2.0: ",nrow(data_gt2))
## Number of students with a GPA above 2.0:  396

As visible below, a large proportion of students have little to no experience with the command line prior to starting the course. This is relevant as compiling and debugging and compiling in C++ is done at the command line level.

barplot(table(data_gt2$prompt_lines),main="Distribution of Command Line Quantities")

Similarly, the a large proportion of students have little experience with C++. I hoped that this variable would have a strong correlation with the response variable, but it appears that is unlikely.

barplot(table(data_gt2$c_lines),main="Distribution of C++ Line Quantities")

As visible below, the majority of students attended lecture twice a week, which is the maximum. If this variable is a signifigant predictor, it could be extrapolated that there is a signifigant relationship between class attendence and the final grade. If it does not act as a signifigant predictor, it could be extrapolated that class attendence is not a signifigant variable in determining the final grade of a student in data structures.

pie<-prop.table(table(data_gt2$lectures_week))
slices <- c(pie[1],pie[2],pie[3]) 
lbls <- c("0 lectures per week", "1 lectures per week", "2 lectures per week")
pct <- round(slices/sum(slices)*100)
lbls <- paste(lbls, pct) 
lbls <- paste(lbls,"%",sep="") 
pie(slices,labels=lbls,col=rainbow(length(lbls)),main="Pie Chart of Lectures Attended Distribution, n=396")

Below is another representation of the same variable.

barplot(table(data_gt2$lectures_week),main="Distribution of The number of lectures attended by week")

3.2 Outlier Detection

The series of boxplots below were created in attempt to visualzie the number of outliers with respect to the response variable. As you can see there are very few outliers, which are denoted as orange.

ggplot(data=data_gt2,aes(x=data_gt2$ds_grade_letter,y=data_gt2$gpa)) + geom_boxplot(varwidth=TRUE, fill="white",outlier.color="orange") + ggtitle("Boxplot of GPA & DS Grade")+labs(x="DS Grade",y="GPA") + coord_flip()

3.3 Change in Major Distribution

As time progressed, it is possible that non Computer Science or ECSE matriculated students are taking the course. It is desirable for students to know if there is any trend in the number of students within their discipline that take the course. This section will investigate the possibility of a difference in the distrubution with respect to time.

3.3.1 Major Cleanup

The code chunk below cleans the major entries. The dual majors that include CSCI will be denoted as CSCI, as their designation as a CSCI major explains their enrollment in the course. The sample size of dual majors is not large enough to draw any additional conclussions, so this loss of information is acceptable for our purposes.

toPlot<-data_gt2
reassign<-as.character(toPlot$major)
temp<-as.character(toPlot$major)
for (i in 1:length(temp)){
    if(temp[i]=="BIOL/CSCI at the time, eventually swapped to just CSCI"||temp[i]=="COGS/CSCI"||temp[i]=="CS and GSAS"||temp[i]=="Physics and Computer Science"||temp[i]=="CSCI and MECL"||temp[i]=="CS / ITWS Dual"||temp[i]=="CS/CSE"||temp[i]=="CSCI/CSE"||temp[i]=="CSCI/GSAS"||temp[i]=="CSYS+CSCI"||temp[i]=="MECL, CSCI"||temp[i]=="CSYS+CSCI"||temp[i]=="ECSE/CSCI"||temp[i]=="CSE/CS"||temp[i]=="CSCI/PSYC"||temp[i]=="CSCI/ITWS"||temp[i]=="CSCI/PSYC"||temp[i]=="ECSE and CSCI"||temp[i]=="CSE CS"||temp[i]=="CSYS/CSCI"||temp[i]=="CSCI and STSS"||temp[i]=="CSCI COGS"||temp[i]=="CSCI/COGS"||temp[i]=="ECSE / CSCI"||temp[i]=="GSAS/CSCI"||temp[i]=="Gsas csci"||temp[i]=="CSCI/MATH"||temp[i]=="Itws & csci"||temp[i]=="CSCI/ECSE"||temp[i]=="MATH/CSCI"||temp[i]=="CSCS")
    {
      reassign[i]<-"CSCI"
    }
  else if(temp[i]=="PHYS/MATH Dual"||temp[i]=="Phys-Math")
    {
      reassign[i]<-"PHYS"
  }
  else if(temp[i]=="ECSE/CSYS")
    {
      reassign[i]<-"ECSE"
  }
  else if(temp[i]=="COGS/GSAS")
    {
      reassign[i]<-"COGS"
  }
  else if(temp[i]=="ChemE")
    {
      reassign[i]<-"CHME"
    }
}
toPlot$major<-reassign

The chunk below combines the semester and year data structures was taken into a single variable for plotting.

toPlot$semester<-as.factor(toPlot$semester)
toPlot$ds_year<-as.factor(toPlot$ds_year)
toPlot<-toPlot %>%group_by(semester,ds_year,major)%>%
summarize(n=n_distinct(survey_id))
reassign<-as.character(toPlot$semester)
tempSem<-as.character(toPlot$semester)
tempYr<-as.character(toPlot$ds_year)
for (i in 1:length(tempSem)){
  reassign[i]<-paste(tempYr[i],tempSem[i])
}
toPlot['SemYr']<-reassign

3.3.2 3D Plot

The chunk below displays the number of semesters that have data for each major. For example, there are 21 terms that CSCI students took data structures. Unfortunately there is not enough data to do any deep analysis for most diciplines.

table(toPlot$major)
## 
## AERO ARCH BFMB BIOL CHEM CHME COGS CSCI DSIS ECSE Engr GSAS ITWS MANE MATH 
##    1   11    1   12   12    1   11   21    1   13    1   11   15    1   13 
## MECH PHIL PHYS PSYC 
##    1   10   11   13
barplot(table(toPlot$major),main="Distribution of Major")

The chunk below crates an interactive 3D plot using plotly. The x axis is the term the course was taken, the y axis is the range of student majors, and the z axis is the number of students in the term and major. As you can see, the most represented major is CSCI which is logical. This plot serves as an excellent deliverable, as students can investigate the distribution for their own major.

NOTE: Engr represents undeclared Engineering students.

toPlot$SemYr<-as.factor(toPlot$SemYr)
toPlot <- toPlot[order(toPlot$SemYr),]
colors = rainbow(length(unique(toPlot$major)))
names(colors) = unique(toPlot$major)
plot_ly(toPlot, x = ~semester, y = ~ds_year, z = ~n, color = ~major, colors = colors) %>%
  add_markers() %>%
  layout(title = 'Major Distribution with respect to Time', scene = list(xaxis = list(title = 'Semester DS was taken'),
                     yaxis = list(title = 'Year DS was taken'),
                     zaxis = list(title = 'Number of Students'))) 

In general there are not enough samples from older datasets to see a trend. Clearly less students take the course in the summer, but that is not a new insight. This plot creation could be expanded to other courses, and would be more effective with better data.

4.0 Response Variable Manipulation

In this section, the response variables are modified and created into the following two variables

binary - A boolean that represents wether or not the student passed the course response - Factors that represent the grade neighborhoods with P omitted, No credit and W as Failing.

4.1 Code

Below are the two code segments that perform the transformations.

# creating smaller bins, removed P as it could be a D to an A
data_no_P<-data_gt2[data_gt2$ds_grade_letter!="P",]
temp2<-as.character(data_no_P$ds_grade_letter)
reassign2<-temp2
for (i in 1:length(temp2)){
    if (temp2[i]=="No Credit")
    {
      reassign2[i]<-"F"
  }
  else if (temp2[i]=="W")
    {
      reassign2[i]<-"F"
  }
  
}
table(reassign2)
## reassign2
##   A   B   C   D   F 
##  98 100  85  44  50
morphedData<-data_no_P

morphedData["response"] <- reassign2
dropnames1<-c("dropped","ds_multiple","employed","employed_field","workload","RCS","survey_id","ds_grade","ds_grade_letter")
morphedData<-morphedData[,!(names(morphedData)%in% dropnames1)]
morphedData$ap_bool<-as.factor(morphedData$ap_bool)
morphedData$hrs_hw<-as.factor(morphedData$hrs_hw)
morphedData$hrs_test<-as.factor(morphedData$hrs_test)
morphedData$prompt_lines<-as.factor(morphedData$prompt_lines)
morphedData$cs1_atRpi<-as.factor(morphedData$cs1_atRpi)
morphedData$c_lines<-as.factor(morphedData$c_lines)
morphedData<-morphedData[complete.cases(morphedData), ]
morphedData$response<-as.factor(morphedData$response)
# creating binary
Legend<-as.character(morphedData$response)
temp2<-as.character(morphedData$response)
for (i in 1:length(temp2)){
    if(temp2[i]=="A")
    {
      Legend[i]<-"P"
    }
  else if (temp2[i]=="No Credit")
    {
      Legend[i]<-"F"
  }
  else if (temp2[i]=="B")
    {
      Legend[i]<-"P"
  }
  else if (temp2[i]=="C")
    {
      Legend[i]<-"P"
  }
  else if (temp2[i]=="D")
    {
      Legend[i]<-"P"
  }
  else if (temp2[i]=="W")
    {
      Legend[i]<-"F"
  }
  
}




temp<-as.character(data_gt2$ds_grade_letter)
reassign<-temp
for (i in 1:length(temp)){
    if(temp[i]=="A")
    {
      reassign[i]<-"P"
    }
  else if (temp[i]=="No Credit")
    {
      reassign[i]<-"F"
  }
  else if (temp[i]=="B")
    {
      reassign[i]<-"P"
  }
  else if (temp[i]=="C")
    {
      reassign[i]<-"P"
  }
  else if (temp[i]=="D")
    {
      reassign[i]<-"P"
  }
  else if (temp[i]=="W")
    {
      reassign[i]<-"F"
  }
  
}
table(reassign)
## reassign
##   F   P 
##  50 346
binaryData<-data_gt2

binaryData["binary"] <- reassign
dropnames1<-c("dropped","ds_multiple","employed","employed_field","workload","RCS","survey_id","ds_grade","ds_grade_letter")
binaryData<-binaryData[,!(names(binaryData)%in% dropnames1)]
binaryData$binary<-as.factor(binaryData$binary)
binaryData$ap_bool<-as.factor(binaryData$ap_bool)
binaryData$hrs_hw<-as.factor(binaryData$hrs_hw)
binaryData$hrs_test<-as.factor(binaryData$hrs_test)
binaryData$prompt_lines<-as.factor(binaryData$prompt_lines)
binaryData$cs1_atRpi<-as.factor(binaryData$cs1_atRpi)
binaryData$c_lines<-as.factor(binaryData$c_lines)
binaryData<-binaryData[complete.cases(binaryData), ]

4.2 Response Variable Distribution

As you can see below, the response variable is well distributed. Note the neighborhoods were created to condense the number of clusters. For example, the C neighborhood represents C+, C, and C- grades. Withdrawns are treated as Failures, Passes are omitted, and No credits are treated as Failures. Ideally, it would be better to have a higher proportion of Failures.

pie<-prop.table(table(morphedData$response))
slices <- c(pie[1], pie[2], pie[3], pie[4], pie[5]) 
lbls <- c("A Neighborhood", "B Neighborhood", "C Neighborhood", "D Neighborhood", "F Neighborhood")
pct <- round(slices/sum(slices)*100)
lbls <- paste(lbls, pct) 
lbls <- paste(lbls,"%",sep="") #for gpa >=2
pie(slices,labels=lbls,col=rainbow(length(lbls)),main="Pie Chart of DS Grade Neighborhoods, n=396")

As visible below, unfortunately the binary response variable is not evenly distributed, thus the sampling techniques will deviate from the norm

pie<-prop.table(table(binaryData$binary))
slices <- c(pie[1], pie[2]) 
lbls <- c("Fail", "Pass")
pct <- round(slices/sum(slices)*100)
lbls <- paste(lbls, pct) 
lbls <- paste(lbls,"%",sep="") #for gpa >=2
pie(slices,labels=lbls,col=rainbow(length(lbls)),main="Pie Chart of binary response variable, n=396")

4.3 Correlation & PCA

The section below investigates feature selection and the correlation of the response variables with the independent variables

4.3.1 Correlation

As you can see in the correlation plot below, there are no strong correlations between the response variable and the independent variables. Note the high correlation clusters around class year, and AP variables are logical.

corM<-cor(binaryDataKnn,method=c("pearson","kendall","spearman"))
corrplot(corM, type = "full", order = "hclust", tl.col = "black", tl.srt = 45)

4.3.2 PCA

As you can see below, the first two principal components capture ~87% of the variance in the samples. The first principal component is based on major, which makes sense as it has the widest range out of all recorded variables.

pca<-prcomp(binaryDataKnn)
summary(pca)#PC1+2 is ~87%
## Importance of components:
##                            PC1    PC2     PC3     PC4     PC5     PC6
## Standard deviation     17.0382 5.1869 3.98733 2.88726 2.16714 1.85418
## Proportion of Variance  0.7931 0.0735 0.04344 0.02278 0.01283 0.00939
## Cumulative Proportion   0.7931 0.8666 0.91008 0.93286 0.94569 0.95508
##                            PC7     PC8     PC9   PC10    PC11    PC12
## Standard deviation     1.74457 1.63269 1.55982 1.3392 1.31539 1.18641
## Proportion of Variance 0.00832 0.00728 0.00665 0.0049 0.00473 0.00385
## Cumulative Proportion  0.96340 0.97068 0.97733 0.9822 0.98695 0.99080
##                           PC13    PC14   PC15    PC16    PC17    PC18
## Standard deviation     1.08720 0.78340 0.7147 0.53776 0.48414 0.44688
## Proportion of Variance 0.00323 0.00168 0.0014 0.00079 0.00064 0.00055
## Cumulative Proportion  0.99403 0.99571 0.9971 0.99789 0.99853 0.99908
##                           PC19   PC20
## Standard deviation     0.43639 0.3837
## Proportion of Variance 0.00052 0.0004
## Cumulative Proportion  0.99960 1.0000
pca$rotation[,1:2]#pc1 based on major, pc2 based on cs1 grade variants.
##                                  PC1          PC2
## class_year              0.0021595865  0.009805713
## semester                0.0004274521  0.012390251
## lecturer                0.0019747682  0.074292699
## ap_bool                -0.0013520532 -0.021088891
## ap_grade               -0.0033435587 -0.135711141
## cs1_grade_rpi          -0.0071332463 -0.756519507
## gpa                    -0.0005269032  0.014024555
## major                   0.9995352718  0.005999246
## hrs_test                0.0040204317  0.015425616
## hrs_hw                  0.0050965513  0.061478608
## age                    -0.0038635489 -0.015628904
## ds_year                 0.0054487997  0.045095701
## cs1_atRpi               0.0001582264  0.030784571
## cs1_grade_other        -0.0227254413  0.450326327
## c_lines                 0.0056845833 -0.071029530
## prompt_lines           -0.0037742890 -0.042814061
## lectures_week          -0.0021207888  0.018749266
## cs1_grade_rpi_letter   -0.0014150186 -0.360098765
## cs1_grade_other_letter -0.0109663383  0.235905303
## response                0.0089319544 -0.015002537

The plot below projects the samples onto the subspace spanned by the first two principal components. The samples are colored based on the response variable. F denotes failures, and P denotes passing. As you can see there is no natural clustering despite the large proportion of variance explained. This plot suggests that clustering methods such as K-means and KNN will likely fail, thus they will not be tested if the other visualizations follow a similar patern.

#PCA 
p1<-ggplot(binaryDataKnn,aes(x=pca$x[,1],y=pca$x[,2],colour=Legend))+geom_point(size=2)
p1+ ggtitle("PCA Denoted by Pass/Fail") +
  xlab("PC1") + ylab("PC2")

As you can see below, there is no clustering when the binary constraint is relaxed to encapsulate the a wider range of grades.

#PCA 
Legend<-as.factor(morphedData$response)
p2<-ggplot(binaryDataKnn,aes(x=pca$x[,1],y=pca$x[,2],colour=Legend))+geom_point(size=2)
p2+ ggtitle("PCA Denoted by DS Final Grade Neighborhood") +
  xlab("PC1") + ylab("PC2")

5.0 Training, Testing, and Model Creation

In this section, the samples will be logically separated into training and testing sets. The training set will be used for model constuction, and the testing set will be used to test the performance of the model. There is no overlap between the training and testing sets.

5.1 Training and Testing Sets

The Code section below separates the samples into training and testing sets. Normally, I would cross validate the training and testing sets, however the distribution of the binary representation of the response variable is not uniform. I comprised the training set of all the failures (n=50), and 60 samples that passed, determined by stratified sampling. The testing set contains the remainder of the samples.

#for binary classifier
set.seed(162)
ss<-as.integer(.80*nrow(binaryData))
value<-sample(1:nrow(binaryData),ss)
traindata<-binaryData[binaryData$binary=="F",]
tempdata<-binaryData[binaryData$binary!="F",]
ss<-60
value<-sample(1:nrow(tempdata),ss)
tempdata<-binaryData[value,]
traindata<-rbind(traindata,tempdata)
testdata<-binaryData[-value,]
#traindata<-binaryData[value,]
#testdata<-binaryData[-value,]
traindata$ap_bool<-as.factor(traindata$ap_bool)
traindata$hrs_hw<-as.factor(traindata$hrs_hw)
traindata$hrs_test<-as.factor(traindata$hrs_test)
traindata$prompt_lines<-as.factor(traindata$prompt_lines)
traindata$cs1_atRpi<-as.factor(traindata$cs1_atRpi)
traindata$c_lines<-as.factor(traindata$c_lines)
#train mods
testdata$ap_bool<-as.factor(testdata$ap_bool)
testdata$hrs_hw<-as.factor(testdata$hrs_hw)
testdata$hrs_test<-as.factor(testdata$hrs_test)
testdata$prompt_lines<-as.factor(testdata$prompt_lines)
testdata$cs1_atRpi<-as.factor(testdata$cs1_atRpi)
testdata$c_lines<-as.factor(testdata$c_lines)

The section below separates the data into training and testing sets via stratified sampling in an 80/20 split for the response variable.

set.seed(162)
ss<-as.integer(.80*nrow(morphedData))
value<-sample(1:nrow(morphedData),ss)
traindataM<-morphedData[value,]
testdataM<-morphedData[-value,]
dropnames<-c("dropped","ds_multiple","employed","employed_field","workload","RCS","survey_id","ds_grade")#,"lectures_week"
traindataM<-traindataM[,!(names(traindataM)%in% dropnames)]
traindataM$ap_bool<-as.factor(traindataM$ap_bool)
traindataM$hrs_hw<-as.factor(traindataM$hrs_hw)
traindataM$hrs_test<-as.factor(traindataM$hrs_test)
traindataM$prompt_lines<-as.factor(traindataM$prompt_lines)
traindataM$cs1_atRpi<-as.factor(traindataM$cs1_atRpi)
traindataM$c_lines<-as.factor(traindataM$c_lines)
testdataM<-testdataM[,!(names(testdataM)%in% dropnames)]

5.2 Model Construction - Random Forest

In this section, random forest is used to predict the response variables

5.2.1 Random Forest - Binary

The line below creates a random forest model in attempt to predict wether or not a student will pass the course.

#random forest for binary - OOB ~45%
randFor<-randomForest(binary~.,data=traindata)

The code below displays the results of the model. As you can see in the Random Forest Error plot, the black line represents the average of the error rates. An approximate 45% error rate is acceptable in training. As you can see in the class error column, the model did not overfit. Unfortunately the testing set is not ideal as it does not contain any true failures. In testing, 80% of the samples classified as passing were correct.

print("Training Results")
## [1] "Training Results"
plot(randFor,main="Random Forest Error")

print(randFor)
## 
## Call:
##  randomForest(formula = binary ~ ., data = traindata) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 4
## 
##         OOB estimate of  error rate: 43.64%
## Confusion matrix:
##    F  P class.error
## F 25 31   0.5535714
## P 17 37   0.3148148
forBinaryTable<-table(predict(randFor,testdata),as.factor(testdata$binary))
forBinaryTable
##    
##       F   P
##   F  44  67
##   P   0 225
print("Testing Results")
## [1] "Testing Results"
cat("Proportion of Accuracy rates in Testing: ",diag(prop.table(forBinaryTable,1)))
## Proportion of Accuracy rates in Testing:  0.3963964 1
cat("\nOverall Accuracy in Testing: ",(sum(diag(prop.table(forBinaryTable))))*100,"%")
## 
## Overall Accuracy in Testing:  80.05952 %

The variable importance plot below displays the importance of the various features in decreasing order. The most important feature is the students grade in CS1 at RPI. This makes sense as the courses are similar.

varImpPlot(randFor,sort=T,n.var=9,main="Variable Importance - Forest")

5.2.2 Random Forest - Neighborhood

As you can see below, the random forest model did not perform well. An estimated 70% error rate is worse than a coin toss. It is unlikely that we can predict the final grade neighborhoods as the indepdendent variables are not closely correlated with the response variable.

randFor<-randomForest(response~.,data=traindataM)
plot(randFor,main="Random Forest error")

print(randFor)
## 
## Call:
##  randomForest(formula = response ~ ., data = traindataM) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 4
## 
##         OOB estimate of  error rate: 72.43%
## Confusion matrix:
##    A  B  C D  F class.error
## A 40 14 12 2 10   0.4871795
## B 17 23 26 5  8   0.7088608
## C  9 31 12 2 11   0.8153846
## D  4 14 14 1  2   0.9714286
## F  7 19 10 1  7   0.8409091
varImpPlot(randFor,sort=T,n.var=9,main="Variable Importance - Forest")

5.3 Model Construction - Neural Network

The commented code below creates a neural net. For the sake of brevity in kniting this notebook, the R Data object was saved and merely reloaded.

set.seed(6432)
#trainc<-trainControl(method="repeatedcv",number=10,repeats=3)
#NN<-train(response~.,data=traindataM,method="nnet",tuneLength=10,trainControl=trainc)
#save(NN,file="C:/Users/JHicks/Desktop/DataScience/NN.Rda")
NN<-get(load(file="C:/Users/JHicks/Desktop/DataScience/NN.Rda"))

As you can see in the confusion matrix below, in testing the neural network preformspoorly at classifying the specific grades neighborhoods. Thus we can confirm with the sucess of the random forest and nerual network that the independent variables are a signifigant predictor for the reponse variable for the A neighborhood at minimum. There are concerns of overfitting due to the sample size.

testpNN<-predict(NN,newdata=traindataM)#testing
NNtable<-table(testpNN,traindataM$response)
NNtable
##        
## testpNN  A  B  C  D  F
##       A 71  6  6  0  0
##       B  3 56 46 19 33
##       C  0  0  0  0  0
##       D  0  0  0  0  0
##       F  4 17 13 16 11
diag(prop.table(NNtable,1))# % correct in each class IN test
##         A         B         C         D         F 
## 0.8554217 0.3566879       NaN       NaN 0.1803279
cat("Percentage Correct in Testing: ",sum(diag(prop.table(NNtable))))#overalll % correct IN Test
## Percentage Correct in Testing:  0.4584718

6.0 Conclusion

The distribution of the response variables could have been more uniform which could result in better models. I was hopeful that there would be natural clustering, thus we could attempt to use KNN/K-means.

The random forest binay classifier performs at an acceptable level, an accuracy rate of ~44% in training and ~80% in testing is adequate to assist students in their college decisions. Obviously the accuracy is high due to the high number of passing samples. We can further validate these models with more data that can be collected in later semesters to verify the claims made and account for any potential overfitting.

There is a signifigant relationship between the independent variables and the response variables as indicitated by the moderately successful models.